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(57) ABSTRACT 

A platform for implementing interactive voice response 
(IVR) applications over the Internet or other type of network 
includes a speech synthesizer, a grammar generator and a 
speech recognizer. The speech synthesizer generates speech 
which characterizes the structure and content of a web page 
retrieved over the network. The speech is delivered to a user 
via a telephone or other type of audio interface device. The 
grammar generator utilizes textual information parsed from 
the retrieved web page to produce a grammar. The grammar 
is supplied to the speech recognizer and used to interpret 
voice commands and other speech input generated by the 
user. The platform may also include a voice processor which 
determines which of a number of predefined models best 
characterized a given retrieved page, such that the process of 
generating an appropriate verbal description of the page is 
considerably simplified. The speech synthesizer, grammar 
generator, speech recognizer and other elements of the IVR 
platform may be operated by a Internet Service Provider 
(ISP), thereby allowing the general Internet population to 
create interactive voice response applications without 
acquiring their own IVR equipment. 
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WEB- BASED PLATFORM FOR 
INTERACTIVE VOICE RESPONSE (IVR) 

FIELD OF THE INVENTION 

The present invention relates generally to the Internet and 
other computer networks, and more particularly to tech- 
niques for obtaining information over such networks via a 
telephone or other audio interface device. 

BACKGROUND OF THE INVENTION 

The continued growth of the Internet has made it a 
primary source of information on a wide variety of topics. 
Access to the Internet and other types of computer networks 
is typically accomplished via a computer equipped with a 
browser program. The browser program provides a graphi- 
cal user interface which allows a user to request information 
from servers accessible over the network, and to view and 
otherwise process the information so obtained. Techniques 
for extending Internet access to users equipped with only a 
telephone or other similar audio interface device have been 
developed, and are described in, for example, D. L. Atkins 
et al., ''Integrated Web and Telephone Service Creation," 
Bell Labs Technical Journal, pp. 19-35, Winter 1997, and J. 
C. Ramming, "PML: A Language Interface to Networked 
Voice Response Units," Workshop on Internet Programming 
Languages, ICCL '98, Loyola University, Chicago, 111., May 
1998, both of which are incorporated by reference herein. 

Users developing Interactive Voice Response (IVR) appli- 
cations to make use of the audio interface techniques 
described in the above references generally must utilize 
costly special-purpose IVR hardware, which can often be 
prohibitively expensive. The expense associated with this 
special-purpose JVR hardware prevents many users, such as 
small business owners and individuals, from building IVR 
applications for their web pages. Such users are therefore 
unable to configure their web pages so as to facilitate access 
by telephone or other audio interface device. 

SUMMARY OF THE INVENTION 

The present invention provides apparatus and methods for 
implementing Interactive Voice Response (IVR) applica- 
tions over the Internet or other computer network. An 
illustrative embodiment of the invention is an IVR platform 
which includes a speech synthesizer, a grammar generator 
and a speech recognizer. The speech synthesizer generates 
speech which characterizes the structure and content of a 
web page retrieved over the network. The speech is deliv- 
ered to a user via a telephone or other type of audio interface 
device. The grammar generator utilizes textual information 
parsed from the retrieved web page to produce a grammar. 
The grammar is then supplied to the speech recognizer and 
used to interpret voice commands generated by the user. The 
grammar may also be utilized by the speech synthesizer to 
create phonetic information, such that similar phonemes are 
used in both the speech recognizer and the speech synthe- 
sizer. In appropriate applications, such as name dialing 
directories and other applications having grammars with 
long compilation times, the grammar produced by the gram- 
mar generator may be partially or completely precompiled. 

An IVR platform in accordance with the invention may 
also include other elements, such as, for example, a parser 
which identifies textual information in the retrieved web 
page and delivers the textual information to the grammar 
generator, and a voice processor which also receives web 
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page information from the parser. The voice processor uses 
this information to determine which of a number of pre- 
defined models best characterizes a given retrieved web 
page. The models are selected to characterize various types 
5 and arrangements of structure in the web page, such as 
section headings, tables, frames, forms and the like, so as to 
simplify the generation of a corresponding verbal descrip- 
tion. 

In accordance with another aspect of the invention, the 

10 speech synthesizer, grammar generator and speech 
recognizer, as well as other elements of the IVR platform, 
may be used to implement a dialog system in which a dialog 
is conducted with the user in order to control the output of 
the web page information to the user. A given retrieved web 

35 page may include, for example, text to be read to the user by 
the speech synthesizer, a program script for executing opera- 
tions on a host processor, and a hyperlink for each of a set 
of designated spoken responses which may be received from 
the user. The web page may also include one or more 

20 hyperlinks that are to be utilized when the speech recognizer 
rejects a given spoken user input as unrecognizable. 

An IVR platform in accordance with the invention may be 
operated by an Internet Service Provider (ISP) or other type 
of service provider. By permitting dialog-based IVR appli- 

25 cations to be built by programming web pages, the invention 
opens up a new class of Internet applications to the general 
Internet population. For example, Internet content develop- 
ers are not required to own or directly operate an IVR 
platform if they have access to an IVR platform from an ISP. 

30 This is a drastic departure from conventional approaches to 
providing IVR service, which typically require the owner- 
ship of expensive IVR equipment. An ISP with an IVR 
platform system will be able to sell IVR support services to 
the general public at relatively low cost. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a system including a 
web-based interactive voice response (IVR) platform in 
40 accordance with the invention. 

FIG. 2 shows a more detailed view of the web-based IVR 
platform of FIG. 1. 

DETAILED DESCRIPTION OF THE 
INVENTION 

45 

The present invention will be illustrated below in con- 
junction with an exemplary system. It should be understood, 
however, that the invention is not limited to use with any 
particular type of system, network, network communication 

50 protocol or configuration. The term "web page" as used 
herein is intended to include a single web page, a set of web 
pages, a web site, and any other type or arrangement of 
information accessible over the World Wide Web, over other 
portions of the Internet, or over other types of communica- 

55 tion networks. The term "platform** as used herein is 
intended to include any type of computer-based system or 
other type of system which includes hardware and/or soft- 
ware elements configured to provide one or more of the 
interactive voice response functions described herein. 

60 1 . System Description 

FIG. 1 shows an exemplary information retrieval system 
100 in accordance with an illustrative embodiment of the 
invention. The system 100 includes a web-based IVR plat- 
form 102, a network 104, a number of servers 106-/, 

65 (»1, 2, . . . N, and an audio interface device 108. The network 
104 may represent the Internet, an intranet, a local area or 
wide area network, a cable network, satellite network, as 
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well as combinations or portions of these and other net- 
works. Communications between the IVR platform 102 and 
one or more of the servers 106-/ may be via connections 
established over the network 104 in a conventional manner 
using the Transmission Control Protocol/Internet Protocol 5 
(TCP/IP) standard or other suitable communication protocol 
(s). The servers 106-/ may each represent a computer or 
group of computers arranged in a conventional manner to 
process information requests received over network 104. 
The audio interface device 108 may be, for example, a 10 
telephone, a television set-top box, a computer equipped 
with telephony features, or any other device capable of 
receiving and/or transmitting audio information. The audio 
interface device 108 communicates with the IVR platform 
102 via a network 109 which may be, for example, a public is 
switched telephone network (PSTN), a cellular telephone 
network or other type of wireless network, a data network 
such as the Internet, or various combinations or portions of 
these or other networks. Although shown as separate net- 
works in the illustrative embodiment of FIG. 1, the networks 20 
104 and 109 may be the same network, or different portions 
of the same network, in alternative embodiments. 

FIG. 2 shows the IVR platform 102 in greater detail. The 
IVR platform 102 includes a web browser 110 which is 
operative to retrieve web pages or other information from 25 
one or more of the servers 106-/ via network 104. The web 
browser 110 may be a conventional commercially-available 
web browser, or a special-purpose browser designed for use 
with audio interface device 108. For example, the web 
browser 110 may support only a subset of the typical web 30 
browser functions since in the illustrative embodiment it 
does not need to display any visual information, i.e., it does 
not need to process any image or video data. The browser 
110 retrieves text, audio and other information from one or 
more of the servers 106 via the network 104. The browser 35 
110 may be configured to play back the retrieved audio in a 
conventional manner, such that the playback audio is sup- 
plied to the audio interface device 108 via the network 109. 
The browser 110 delivers the retrieved text and other infor- 
mation to an HTML parser 112. The parser 112 performs 40 
preprocessing operations which configure the retrieved text 
so as to facilitate subsequent interpretation by a voice 
processor 114 and a grammar generator 120. The retrieved 
text is assumed in the illustrative embodiment to be in an 
Hyper Text Markup Language (HTML) format, but could be 45 
in other suitable format(s) in other embodiments. For 
example, the IVR platform 102 may also be configured to 
process web page information in a Phone Markup Language 
(PML) format. PML is a language specifically designed to 
build telephone-based control into HTML pages, and includ- 50 
ing PML capability in the IVR platform allows it to better 
support a wide variety of web-based IVR applications. 

The voice processor 114 performs analysis of the text and 
other web page information supplied by the HTML parser 
112, and generates corresponding verbal descriptions which 55 
are supplied to a text-to-speech (TTS) synthesizer 116. The 
HTML parser 112, voice processor 114 and TTS synthesizer 
116 transform the text and other web page information into 
speech which is delivered to the audio interface device 108 
via the network 109. The grammar generator 120 utilizes the 60 
text and other web page information received from the 
HTML parser 112 to produce one or more speech recogni- 
tion grammars which are delivered to a speech recognizer 
122. The speech recognizer 122 receives speech input gen- 
erated by the audio interface device 108, and utilizes the 65 
grammar produced by grammar generator 120 to recognize 
words in the speech. Appropriate indicators of the recog- 
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nized words are then supplied to the spoken command 
interpreter 124, which interprets the indicators to generate 
corresponding command signals. The command signals are 
supplied to a processor 130 which controls the operation of 
at least a portion of the IVR platform 102. The IVR platform 
102 further includes a dual-tone multiple frequency (DTMF) 
decoder 126 which decodes DTMF signals received in 
platform 102 from the audio interface device 108 via the 
network 109. Such signals may be generated, for example, 
in response to selections offered in the audio playback or 
speech supplied from IVR platform 102 to the audio inter- 
face device 108. The decoded DTMF information is sup- 
plied from the decoder 126 to the processor 130. 

The processor 130 interacts with a memory 132, and with 
the web browser 110. The processor 130 may be a 
microprocessor, central processing unit, application-specific 
integrated circuit (ASIC) or any other digital data processor 
which directs the operation of at least a portion of the IVR 
platform 102. For example, the processor 130 may be a 
processor in a computer which implements the web browser 
110 or one or more of the other elements of the IVR platform 
102. The memory 132 may represent an electronic memory, 
a magnetic memory, an optical memory or any other 
memory associated with the IVR platform 102, as well as 
portions or combinations of these and other memories. For 
example, memory 132 may be an electronic memory of a 
computer which, as noted above, may also include processor 
130. In other embodiments, the IVR platform 102 may be 
implemented using several interconnected computers as 
well as other arrangements of suitable processing devices. 

The TTS synthesizer 116, speech recognizer 122, spoken 
command interpreter 124, DTMF decoder 126, processor 
130 and memory 132, as well as other elements of IVR 
platform 102, may be elements of conventional systems that 
are part of or include a base platform such as the Intuity/ 
Conversant system or Lucent Speech Processing System 
(LSPS), both from Lucent Technologies Inc. of Murray Hill, 
NJ. As previously noted, the IVR platform 102 may be 
implemented using one or more personal computers 
equipped with commercially available speech and telephony 
system boards. It should be noted that the dotted line 
connections between platform 102 and audio information 
device 108 in FIG. 2 may represent, e.g., a single connection 
established through the network 109, such as a telephone 
line connection established through a PSTN or a cellular or 
other type of wireless network. 

The IVR platform 102 in an illustrative embodiment may 
be configured to respond to either voice commands or 
DTMF signals in one of the following three modes: (1) 
DTMF only, in which descriptions include phrases to 
associate, e.g., button numbers on audio interface 108 with 
information available via a retrieved web page; (2) voice 
only, where a concise description of a retrieved web page is 
given in the form of speech generated by TTS synthesizer 
116; and (3) both DTMF and voice, where both speech 
description and phrases identifying button numbers and the 
like may be given. The DTMF only mode may be desirable 
when operating audio interface 108 in a noisy environment, 
such as a busy city street or in a crowd of people, because 
background noise might be interpreted as voice commands 
by IVR platform 102. The voice only mode is often most 
desirable, because it tends to produce the most rapid page 
descriptions. 

The voice processor 114 in IVR platform 102 takes the 
output from the HTML parser 112 and further analyzes the 
corresponding retrieved HTML web page to identify struc- 
ture such as, for example, section headings, tables, frames, 



10/08/2003, EAST version: 1.04.0000 



US 6,587,822 B2 

5 6 

and forms. The voice processor 114 in conjunction with TTS each indicator by generating all possibly ways of speaking 

synthesizer 116 then generates a corresponding verbal subsets of the indicator. All other voice commands are then 

description of the page. In general, such a verbal description combined with the subgrammar and a complete grammar is 

may include speech output corresponding to the page text, compiled into an optimized finite -state network. This net- 

along with descriptions of sizes, locations and possibly other 5 work * s loaded into the speech recognizer 122 to constrain 

information about images and other items on the page. lhe possible sequences of words that can be recognized. 

Depending on the preference of the user, the page can be 0ther l yP es of grammar generation could also be used in 

described by content or by structure. For example, a user conjunction with the invention. 

may be permitted to choose either a description mode or an A bv P roduct °f the illustrative grammar generation pro- 
inspection mode. In an example of the description mode, the 10 *?* ^plemented m grammar generator 120 is the creation 

IVR platform 102 will immediately start to describe a new £j^ H °* Th 11 / » ™ Y ^Tfl 

. r . • i ■ .u ■ ttc • * processed by the TTS synthesizer 116 to create a list of 

web page upon retrieval using the various TTS voices to £ honetic transcH tions & bolic form . llie same ho _ 

indicate various special elements of the page. The user can nemes ^ used in both the fa r nizer U2 \ nd 

command IVR platform 102 to pause, backup, skip ahead, the TTS synthesizer 116 . The symb olic phonetic 

etc., in a manner similar to controlling an audio tape player, is descriptions, once loaded into the recognizer 122, tell the 

except that content elements such as sentences and para- recognizer how the vocabulary words are pronounced, thus 

graphs can be skipped. making it possible for the IVR platform 102 to recognize 

In an example of the inspection mode, IVR platform 102 virtually any spoken word, 
will briefly describe the structure of the page and wait for In normal operation, the IVR platform 102 describes 
spoken inspection commands. Inspection commands allow 20 retrieved web pages to the user via the speech output of the 
the user to "descend" into elements of the page to obtain TTS synthesizer 116. The user controls the IVR platform 
greater detail than might normally be obtained in the 102 by speaking over the TTS synthesizer output, thus 
description mode. For example, each element of a table can "barging in." Echo cancellation may be used to remove TTS 
be inspected individually. If a given table element also has synthesizer output from the speech recognition input so that 
structure, the user can descend into this structure recursively. 25 speech recognition will be unaffected by the TTS output. 
The inspection mode uses appropriate dialog to provide the When the user speaks for a sufficiently long period, the TTS 
user with flexibility in controlling the way information is output may be interrupted, such that speech recognition can 
delivered. The user may be given control over the TTS be more effectively performed, and the speech recognizer 
speaking rate, and the ability to assign various TTS voices output is interpreted into an IVR platform command, 
to certain HTML element types such as section heading, 30 As part of the grammar generation process, voice com- 
hyperlink titles, etc. In addition, section headings may be mand interpretation tables may be established for use later in 
rendered in a different voice from ordinary text. If section the interpretation phase. For example, a stored table of 
headings are detected, initially only the headings will be possible command phrases may be used to associate corn- 
described to the user. Voice commands can then be used to puter instructions with each phrase. Typically, no ambiguous 
instruct IVR platform 102 to move to a particular section, 35 browser command phrases are defined. In the case of pro- 
i.e., the user can speak the heading title to instruct IVR cessing a hyperlink, the Uniform Resource Locator (URL) 
platform 102 to move to that section. of the hyperlink is associated with all possible subsets of the 

The above-noted tables may be used for page layout only hyperlink title. Section titles can be handled in a similar 

or may be true tabulations. The page analysis process manner similar. Subsequently, when a title word is spoken, 

implemented in HTML parser 112 and voice processor 114 40 the associated URL(s) can be retrieved, 

determines which is most likely and generates descriptions It is possible that more than one URL and/or browser 

accordingly. True tabulations are described as tables. Tables command will be retrieved when the spoken title words are 

used for page layout purposes are generally not described not unique. In such a case, a simple dialog may be initiated 

explicitly, but table element locations may be described if such that the user is given a choice of full title descriptions 

deemed important. An inspection mode can be used to 45 that can be selected cither by spoken number or by speaking 

override this table treatment when, e.g., IVR platform 102 an unambiguous title phrase. If the phrase is still ambiguous, 

hides table descriptions. Frames can also be handled in a a new and possibly smaller list of choices may be given. The 

number of ways, including a full page description method user can back up at any time if the selection process has not 

and a frame focus method. The full page description method yielded the desired choices. This allows the user to refine the 

merges the information from all frames into a single context 50 list and converge on one choice, 

that allows the user to verbally address all elements inde- 2. Processing Details 

pendently of the frames. The frame focus method allows the Various aspects of the voice processing and other opera- 
user to specify a frame to be described or inspected, such lions performed in the IVR platform 102 of FIG. 2 will be 
that voice commands are focused on that frame. Forms may described in greater detail below, 
be described, for example, in terms of field title labels, with 55 2.1 HTML Parsing 

the fields addressable by speaking field titles. In addition, As noted above, the HTML parser 112 parses HTML in 

general items can be entered into form fields by spelling, and retrieved web pages for the purposes of facilitating produc- 

the above -described inspection mode can be used to obtain tion of speech output and generation of grammar. The 

menu choices. HTML parsing process is purposely kept relatively simple. 

The grammar generator 120 in IVR platform 102 auto- 60 Full context-free parsing is not required and may even be 

malically generates speech recognition grammar and undesirable because, while HTML is typically well 

vocabulary from the HTML of a retrieved web page. This is structured, many real-world HTML pages include software 

an important feature of IVR platform 102 that makes it bugs and other errors. Therefore, relying on the HTML 

useful for building IVR applications. The parsed HTML is standard and enforcing a strict context-free parsing will 

analyzed in grammar generator 120 for section titles, hyper- 65 often be counterproductive. 

links and other indicators that are to converted to speech. Proper generation of speech output requires an explicit 

Grammar generator 120 then constructs a subgrammar for representation of the structure of a given web page. The 
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HTML parsing process is used to obtain a representation of 
this structure. Important elements such as frames, tables and 
forms are identified and their scope within their containing 
elements is analyzed. For example, a form can be contained 
in a table, which it turn can be contained in a frame, etc. A 5 
critical part of this analysis is to determine the structural 
significance of these elements as opposed to their graphical 
significance. For example, several levels of tables may be 
used in a web page for the sole purpose of alignment and/or 
generating attractive graphics around various elements. In 
such a case, the entire set of tables may be structurally 
equivalent to a simple list. Proper voice rendering in this 
case requires that the tables be ignored and only the bottom- 
level elements be spoken, i.e., described to the user as a list. 
In the case of a "real" data table, the table would instead be 
described as such. 15 

The parsing process itself presents two significant prob- 
lems which are addressed below. The first is that various 
relationships must be derived from the HTML and explicitly 
represented, whereas a normal browser replaces the explicit 
representation with a rendered page image. Thus, the rep- 20 
resentation must explicitly know, e.g., which words are bold, 
italic and part of a link title, as opposed, e.g., to those that 
are italic and part of an H3 title. Any particular combination 
could have significance in showing relevant structure. This 
problem is addressed in the HTML parser 112 by "render- 25 
ing" the page into data structures. Each string of text with 
uniform attributes has an attribute descriptor that specifies 
all the features, e.g., such as bold, link text, heading level, 
etc., currently active in that string. This does not itself 
provide a hierarchical structure. However, such a structure, 30 
although generally not necessary at the HTML source level, 
can be generated by examining the tag organization. 

The second parsing problem is that HTML pages often 
include errors. This means that a document that appears 
well-structured on the screen may be poorly structured at the 35 
source level. The HTML parser 112 must analyze the 
improperly structured source and determine a we 11 -formed 
structure that is equivalent to what the user would see on the 
screen. This can be tricky in some common cases, such as a 
missing <TD> within a table, which can cause a conven- 40 
tional browser to discard the element. This is particularly 
troublesome for cases involving form elements. This prob- 
lem should become less significant as automated tools 
become more widely used. However, such tools are also 
likely to lead to a proliferation of excess HTML, e.g., 45 
mu It i- level tables used for layout. 

As previously noted, the grammar generation process 
requires extracting the hyperlink titles, and saving the URLs 
from the page. Any so-called alternative or "ALT" fields, 
intended for use with browsers which have no image 50 
capabilities, may also be extracted as part of this process. In 
addition, certain other text such as section headings can be 
included in the speech grammars. The parsing operations 
required to do thus extraction can be implemented using 
conventional regular expression parsing. 55 
2.2 Verbal Rendering 

The web page description generated in the IVR platform 
102 is referred to herein as a verbal rendering of the web 
page. In an illustrative embodiment, the user can be permit- 
ted to decide whether to have automatic presentation of the 60 
title of the page. If the user has selected automatic presen- 
tation of the page title, the title will be stated to the user. The 
verbal rendering will then continue with either a description 
of the page content or a description of the page structure, 
again depending, e.g., on previously-established user pref- 65 
erences. Typically, the simpler of these two approaches is the 
structural page description. 
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As previously noted, two modes of page description 
operation may be provided: a description mode and an 
inspection mode. In the description mode, the IVR platform 
will continue to render the page until instructed otherwise or 
the description is complete. The inspection mode gives the 
user the initiative such that the user can ask questions and 
get specific answers. Utilizing the inspection mode allows 
the user to descend recursively into structural elements of 
the page. The user can switch between the description and 
inspection modes under voice control. 
2.2.1 Structure Description 

The page structure is generally described in terms of the 
placement of elements like images, tables and forms. In the 
inspection mode, the user will typically get a top-down 
description with options to open various elements. Consider 
as an example a simple web page made of three forms: a 
title/information frame across the top, an index bar down the 
side, and a main page. A top-level description of this page 
might be "a title frame, index frame and a page." In this case, 
the user would specify a focus to one of the three areas for 
further description. During navigation, links in the title 
and/or index frames would be available at all times or only 
on request, based on user preference. Certain other common 
features, such as a single -entry search form, may also be 
described as a top-level layout item even if not in a separate 
frame. If the page contains a search form, the page could be 
described as "a title frame, index frame and a page with a 
search form." 

Description of the main page may be based on apparent 
structure. For example, if there are four section entries, i.e., 
<H1> entries, on the page, then the description would be "a 
page with five sections." The section headers, i.e., <H1> 
contents, plus "top of page" would be available for speaking 
to jump to that section. If the user says nothing, then the 
system either waits or starts with the first section, based on 
user preference. Note that other entities can be the basis for 
section breakdown. For example, a page with several lists, 
each proceeded by a short paragraph of plain text, could be 
broken down into one section per list, with the apparent 
heading paragraph being spoken to the user. 

Description of a section may also be done based on 
apparent structure. If the section is plain text, then the 
number of paragraphs is announced and speaking begins, 
with navigation between paragraphs supported. Subsection 
breakdown can be performed in a similar manner, based on 
the presence of lower-level headers or bold lines that appear 
to be used as section headers. This subsection analysis will 
probably not go past this second level as the user may be 
unable to keep track of position with more levels. All other 
information could be read sequentially. 

If the page includes a table, a determination is made as to 
its purpose. Examples of different purposes include 
graphics, alignment, or data. Graphics indicates that the 
table is only there to get a particular background or border, 
and such tables are ignored. The difference between align- 
ment and data is that in an alignment table the contents are 
inherently one-dimensional whereas in a data table the 
contents arc arranged in a two-dimensional array. The con- 
tents of an alignment table are either treated as a list or 
ignored, based on whether significant alignment would be 
apparent to a viewer. A data table is described as such, with 
the number of rows and columns announced, and an attempt 
made to locate the row and column headers. Navigation 
based on the two-dimensional structure is available. 

Form description depends on the relative size of the form 
within a page. One single-entry form may be handled in the 
manner described above. A larger form that appears to be 
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only part of a page may be announced as such but is 
generally accessed as its elements appear in the reading. 
Direct navigation is possible based on form number and 
element number. Finally, a page that is mostly a form is 
treated as a form rather than a page. An attempt is made to 
locate the name of each entry to aid in description and direct 
navigation. Also note that a section, subsection or other 
localized form within a page may be treated in a similar 
manner. This introduces modal processing where once the 
form is "entered," then navigation is form-based rather than 
paragraph or section based, until the form is "exited," i.e., 
submitted or skipped. 
2.2.2 Content Description 

Page content is described to the extent possible by using 
I VR platform 102 to synthesize text on the page and describe 
the known content of images, tables, forms and other 
structure. More specifically, designated types of speech may 
be generated for each of the various HTML elements, e.g., 
for hyperlink titles, bold text, form field labels, and other 
elements useful in navigation. The designated types of 
speech can be user-defined. 
2.3 Web Page Analysis 

In accordance with the invention, web page analysis 
carried out in the IVR platform 102 attempts to fit a given 
web page to one of several predefined page models, with a 
default top-down strategy used for pages that do not fit. The 
objective is to maximize user comprehension of pages by 
designing models that have an easy-to- remember structure, 
i.e., we want to prevent a user from getting lost and make it 
easy to locate relevant parts of a page. For this reason the 
models may be made inherently simple and mostly sequen- 
tial with minimum hierarchy. Analysis consists of the two 
steps of identifying the best model and then fitting the page 
content to the model parts. Navigation options may then be 
partly controlled by the model. This should simplify use for 
experienced users because the model can be announced, 
thereby signaling the optimum navigation strategy. 

In the illustrative embodiment, three levels of models are 
used: frame, page and section. This is because pages can 
change within otherwise constant frames. We want to model 
the frame layout separately because it can remain constant, 
so use of the frame model can simplify navigation. In 
general, most section models may be implemented as page 
models applied to a single section. The following is an 
exemplary set of frame models: 

1. Single frame or no frames. In this case, no mention of 
frames is made, simply that there is "a page." 

2. Main page and auxiliary. There is a single main frame 
for the page and surrounding frames for constant 
material, such as a header, index bar or search form. 
The example given above fits this model. 

3. Split-screen. This means that the multiple frames arc all 
logically part of the same page, which simply permits 
different areas to be visible at the same time while 
others are scrolled. The difference is that some of the 
frames are intended to remain constant while others 
switch page contents. Note that identifying this model 
can be difficult without an embedded hint. 

4. Multi-page. This is a catch-all model for all multi-frame 
layouts that do not fit any other model. In this case, it 
is not clear whether the frames remain related or which 
are more constant than others. An example would be 
two frames that each take half of the total screen, 
without any embedded hint that one of the other models 
fits. 

Each page within a frame set is then matched against a set 
of page models, although the specified frame model can 
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imply that certain frames contain certain types of pages. The 
following is an exemplary set of page models: 

1. Title area. This model applies only to a page in a 
title -area frame. No navigation except top to bottom 
reading applies. Links and limited forms are permis- 
sible. 

2. Index area. This model applies to a frame of index 
links. It is treated a list, or a set of lists if headers are 
apparent. Navigation is top to bottom or to a header. A 
simple form is permissible, which can be directly 
navigated to by the user. 

3. Form. This model indicates that the entire page consists 
mostly of a form. All navigation is customized for 
forms. This can be a main or auxiliary page, and also 
applies to sections. 

4. Plain page. The page has no detectable structure beyond 
paragraphs, if even that. Reading is top to bottom with 
paragraph navigation. This also applies to sections. 

5. List. The page consists mostly of a list. Also permis- 
sible is header and trailer material. Note that the list can 
be made of structures besides an <OL> or <UL>, such 
as tables. This also applies to sections or isolated lists. 

6. Table. The page consists mostly of a true table, plus 
optional header and trailer material. The table structure 
is described in terms of rows, columns and headers, and 
navigation based on this structure is available, e.g., 
"read row two." This also applies to sections or isolated 
tables. 

7. Image. This means that the page is mostly an image, 
possibly with a caption or title. This implies that it is 
apparently not really just a list in bitmap form. This also 
applies to sections or isolated images. 

8. Slide table. This is a list of images, possibly two- 
dimensional, optionally with captions. A two dimen- 
sional list with apparent row and column headers is a 
table whose contents are images, whereas without these 
headers it is a slide table. Note that an apparent slide 
table may really be a command list where bitmaps are 
used instead of text, although this is a difficult distinc- 
tion to make. 

9. Sectioned page. This model indicates that the page is 
broken into a number of top-level sections by a set of 
<H1> or other entries. Navigation to individual sec- 
tions is supported, and the section header list can be 
requested. This is also carried out to one additional 
subsection level. Subsections are only available within 
the current section. 

10. Multi-sectioned page. This is a special case of the 
sectioned page where there are more than two levels 
but there is a strict hierarchical numbering scheme, 
such as "Section l.A.4." These section numbers are 
used for navigation and are globally available. The 
headers are also available within the active section tree. 
The difference with the sectioned page is that without 
the strict numbering, sectioning is not done past two 
levels due to the probability of confusion. 

It should be emphasized that the frame, page and section 
models described above are examples only, and a subset of 
these models, as well as combinations of these and other 
models, may be used in a given embodiment of the inven- 
tion. 

2.3.1 Images and Text 

In the illustrative embodiment, paragraphs are generally 
read top to bottom, with repeat and skip commands being 
available for navigation. Paragraphs in a section can be 
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optionally numbered for quick navigation. Almost any non- 
text item will start a new paragraph. The main embedded 
items are links, font changes and images. Images are con- 
sidered embedded if the text flows around them, but arc 
considered separate paragraphs if they stand alone on a 5 
given "line" of the page. Embedded links may be read in a 
different voice. Font changes are normally ignored, but user 
preferences can be set to assign different voices to them. A 
paragraph with embedded images may be announced as such 
before any of its textual content is read. Images can be 10 
described, e.g., by caption, and the request of a particular 
image may be done by number, with numbering done in 
row- major order. Typically, no mention of these images is 
made while reading the text. Isolated images, e.g., image- 
only paragraphs or table elements, may be described, e.g., as is 
"an image captioned . . ." and possibly with the size 
announced. 

2.3.2 Tables 

In accordance with the invention, tables arc analyzed to 
classify their purpose. Tables with a single element are 20 
generally ignored and their element used without regard to 
the table. Tables with row and/or column headers are gen- 
erally classified as data tables and described and navigated 
as such. All other tables are examined for a fit to various 
models. An exemplary set of table models may be as 25 
follows: A table with two elements, one of which is an 
image, is taken to be an image and title combination. This 
becomes an "image" and the table itself is ignored. A table 
whose elements are mostly form elements is taken to be a 
form. The table structure is used to associate titles with 30 
elements and to establish next/previous relationships but is 
otherwise not mentioned to the user. A table whose elements 
are plain text or links is taken as a list. 

2.3.3 Forms 

In the illustrative embodiment, forms may be classified as 35 
either "embedded" or "plain." An embedded form with a 
single element or other type of small form may be viewed as 
an entry area, e.g., a search entry. These types of forms can 
be treated as top-level items, e.g., search, or as plain 
paragraphs, e.g., a "give us your comments" element at the 40 
end of a page. All other forms are treated as plain forms. The 
main point of the form analysis is to enable description and 
form -specific navigation. We generally want to classify all 
elements in a form as to whether they are "global descrip- 
tive" or arc a title, instructions, etc. associated with a 45 
particular element. We also want to establish previous/next 
relationships. Note that material immediately before or after 
a form can be considered part of the form, e.g., as a title or 
notes. The analysis in the illustrative embodiment generally 
assumes that the form is syntactically inside or close to the 50 
<FORM> and </FORM> pair, even though form elements 
can be located throughout plain page. The analysis attempts 
to make use of adjacency in the HTML source, or in 
corresponding tables. Note that a table with headers that 
contain "significant and regular'* form entries may be con- 55 
sidered a form with table navigation added, whereas a table 
with only a few entries might instead be described as a table 
with incidental form elements. 
2.4 Automatic Grammar Creation 

As noted above, the grammar generator 120 in IVR 60 
platform 102 generates speech grammars from hyperlink 
titles and other web page information. This grammar gen- 
eration may involve, for example, creating a Grammar 
Specification Language (GSL) description of each possible 
subset of the title words. The resulting GSL is compiled and 65 
optimized for the speech recognizer 122. In addition, the 
vocabulary words used in the grammar are phonetically 
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transcribed using the ITS synthesizer 116. Additional 
details regarding GSL can be found in, for example, M. K. 
Brown and J. G. Wilpon, "A Grammar Compiler for Con- 
nected Speech Recognition," IEEE Transactions on Signal 
Processing, Vol. 39, No. 1, pp. Jan. 17-28, 1991, which is 
incorporated by reference herein. 

2.4.1 Combinatorics 

Flexibility may be added to the voice navigation com- 
mands through the use of combinatoric processing, e.g., 
computing all 2"' 1 possible combinations of the title words, 
while keeping the words in order. This process provides a 
tightly constrained grammar with low perplexity that allows 
all possible word deletions to be spoken, thereby giving the 
user freedom to speak only the smallest set of words 
necessary, e.g., to address a given hyperlink. The process 
can also create many redundancies in the resulting GSL 
description, because leading and trailing words are reused in 
many subsets. The redundancy may be removed when the 
grammars are determinized, as will be described below. 
Small word insertions may be allowed by inserting so-called 
acoustic "garbage" models between words in the hyperlink 
title subsets. This can be done automatically by the grammar 
generator 120. The combinatoric processing may be inhib- 
ited when <GRAMMAR> definitions are encountered. A 
mixture of hyperlink titles and <GRAMMAR> definitions 
can be used on a single page to take advantage of the features 
of each method. 

2.4.2 Grammar Compilation 

In the illustrative embodiment, grammar compilation gen- 
erally involves the steps of preprocessing the created GSL to 
include external files, expanding macros, parsing the 
expanded GSL and generating grammar network code. The 
grammar code describes grammar rules that define how 
states of a finite -state network are connected and what labels 
are attached to the state transitions. For additional details, 
see M. K. Brown and B. M. Buntschuh, "A Context-Free 
Grammar Compiler for Speech Understanding Systems," 
ICSLP '94, Vol. 1, pp. 21-24, Yokohama, Japan, Sep, 1994, 
which is incorporated by reference herein. The resulting 
finite-state network is typically large and redundant, espe- 
cially if most of the GSL is created from hyperlink titles, 
making the grammar inefficient for speech recognition. In 
accordance with the invention, this inefficiency may be 
reduced in four stages of code optimization. 

The first stage involves determinizing the grammar using 
the well-known finite-state network determinization algo- 
rithm. This eliminates all LHS redundancy in the grammar 
rules making the resulting network deterministic in the sense 
that, given an input symbol, the next state is uniquely 
defined. All grammar ambiguity is removed in this stage. 
The second stage of optimization minimizes the number of 
states in the network using the 0(n log (n)) group partition- 
ing algorithm. This eliminates all homomorphic redundancy 
while preserving determinism. This is the state-minimal 
description of the grammar, but is not necessarily the most 
efficient representation for speech recognition. The third 
stage of optimization removes all RHS grammar rule redun- 
dancy. This operation does not preserve determinism, but 
does eliminate redundant state transitions. Since state tran- 
sitions carry the word labels that represent word models and 
therefore cause computation, reducing redundancy in these 
transitions is beneficial even though the number of states is 
usually increased in the process. The last stage of optimi- 
zation is the removal of most null, i.e., "epsilon," state 
transitions. Some of these null transitions are created in the 
third stage of optimization. Others can be explicitly created 
by a <GRAMMAR> definition. While null transitions do not 
cost computation, they waste storage and therefore should 
be eliminated. 
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It should be noted that in alternative embodiments of the 
invention, grammars may be partially or completely pre- 
compiled rather than compiled as the grammars are used. 
Such an arrangement may be beneficial for applications io 
which, for example, the grammars are very large, such as 5 
name dialing directories, or would otherwise require a long 
time for compilation. 
2.4.3 Phonetic Transcription 

The above-noted vocabulary words are extracted from the 
grammar definitions during the compilation process. For 
example, each word may be processed in isolation by a 
pronunciation module in the TTS synthesizer 116 to create 
phonetic transcriptions that describe how each word is 
pronounced. This method has the disadvantage of ignoring 
context and possibly mispronouncing a word as a noun 
instead of a verb or vice versa, e.g., object, subject, etc. 15 
Context information may be included in order to provide 
more accurate pronunciation. 
2.5 Voice Interpretation 

In the illustrative embodiment, voice commands may be 
interpreted rapidly by using hash tables keyed on the spoken 20 
phrases. This is typically a " many-to-many" mapping from 
speech recognizer output text to computer commands or 
URLs, If more than one URL and/or command are retrieved 
from the table a disambiguation dialog manager may be 
utilized to direct the user to make a unique selection. 25 
Separate hash tables can be maintained for each web page 
visited so that grammar recompilation is not necessary when 
revisiting a page. This can lead to the creation of many hash 
tables, but the table size is typically small, thus making this 
an effective method for web page browsing. For large 30 
grammar applications, it may be possible to automatically 
create a semantic parser using the grammar compiler. Inter- 
pretation then can be done in two stages, e.g., if a hash table 
created from hyperlink titles is found, in a first stage, not to 
contain the key phrase, then the semantic parser can be used, 35 
in a second stage, to interpret the phrase. 
3. General Web-Based IVR Applications 

The IVR platform 102 in accordance with the invention 
not only provides a speech ^contra lied web browser, but can 
also be used to allow Ihe general Internet population to build 40 
IVR applications. The advantage of this approach is the 
elimination of the need for the individual or small business 
user to own any special IVR equipment. As previously 
noted, typical IVR platforms are very expensive, and there- 
fore only moderately large businesses or ISPs can generally 45 
afford to own this equipment. However, since a user can 
program applications which utilize the IVR platform 102 by 
simply writing HTML, PML or other types of web pages, 
while obtaining the IVR platform service from an ISP which 
owns that platform, the small business or individual user 50 
does not need to make any large investment in equipment. 

As noted previously, each ordinary hyperlink title in a 
given page or set of pages may be processed to produce 
subgrammars that allow all spoken subsequences for the 
words in the title. For general IVR applications, the content 55 
developer can write more complex grammars by, e.g., insert- 
ing a <GRAMMAR> tag, followed by a grammar written in 
GSL, followed by a <GRAMMAR> tag. Using this method, 
many entirely different phrases can be used to address the 
same URL. The use of GSL in these applications is similar 60 
to its normal use for defining speech grammars in other 
applications. For example, the local <GRAMMAR> scope 
may comprise the entire definition for the current URL. 
Included files can contain surrounding grammar definitions. 
Macros can be defined either within the local <GRAM- 65 
MAR> scope or can reside in included files. All macros 
typically have global scope within the web page. 
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Local applet code and other types of application code in 
a web page may be used to give the IVR content developer 
the means to perform operations on either a server or a 
client. In a typical IVR platform application, Java code 
might be used to perform operations at the server that could, 
in turn, control remote devices through the Internet or the 
PSTN using additional hardware at the remote end. Since 
HTML pages on the Internet form an implicit finite-state 
network, this network can be used to create a dialog system. 
The resulting system uses dialog to control the output of web 
page information to the user. Even without an applet 
language, such a dialog system can be built using the 
techniques of the invention. 

More specifically, an IVR web page implemented in such 
a dialog system may include, e.g., possibly null text to be 
spoken to the user when the page is read, a program script 
that would execute operations on the host processor, and a 
possibly silent hyperlink for each appropriate spoken 
response from the user. In addition, there may be other 
hyperlinks that are taken when the speech recognizer rejects 
an utterance as unrecognizable. Using these basic building 
blocks, a dialog system can be constructed. 

As a simple example, a representation of the <GRAM- 
MAR> tag embedded in a hyperlink (e.g., HREF ="http:// 
www.anywhere.net/" GRAMMAR="((get|retrieve|call for) 
messages)" TITLE«="Get messages") can represent a flexible 
set of alternative utterances that a user can say to cause an 
action such as initiating a phone call to the user's answering 
machine. In this case the hyperlink is not silent since the title 
part of the hyperlink is spoken to the user: "Get messages." 
If the title part of the hyperlink is empty, then nothing is 
spoken to the user. The user can respond with "get 
messages," "retrieve messages," or "call for messages" in 
this simple example. By speaking a command and following 
this link to the next web page, the user may then be read text 
on that page, e.g., "Do you want voice or email messages?" 
Two hyperlinks on that page with appropriate speech gram- 
mars would then link to appropriate pages to cause access to 
voice messages or email. A third default link might be taken 
when the utterance is not understood since the speech 
recognizer can be configured to return a token to indicate 
non-recognition. For each of the message choices there may 
be a further set of web pages to deal with functions such as 
reading, saving, deleting messages and responding to mes- 
sages. Another example of a representation of a <GRAM- 
MAR> tag embedded in a hyperlink is HREF="http:// 
www.anywhere.net/" GRAMMAR_FILE=<URL>. In this 
case, the specified URL indicates where the grammar file can 
be found. Many other types of dialog systems can be 
constructed in a similar manner using the techniques of the 
invention. 

The ability to build dialog systems in this manner opens 
up a new class of Internet applications to the general Internet 
population, without requiring content developers to own or 
directly operate an IVR platform as long as they have 
services of an IVR platform available from a service pro- 
vider such as an ISP. As previously noted, this is a drastic 
departure from conventional approaches to providing IVR 
service, which typically require the ownership of expensive 
IVR equipment. An ISP with an IVR platform system will be 
able to sell IVR support services to the general public at 
relatively low cost. Corporations with more demanding 
response requirements may ultimately want to operate their 
own platforms for a limited community of employees, but 
can develop and test their IVR web pages before committing 
to purchase costly equipment. 

The above-described embodiments of the invention are 
intended to be illustrative only. Alternative embodiments 
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may incorporate additional features such as, for example, 
Optical Character Recognition (OCR) for generating audible 
information from retrieved web pages, analysis of images 
for verbal rendering, e-mail to speech conversion, and 
speaker verification for secure access. These and numerous 
other alternative embodiments within the scope of the fol- 
lowing claims will be apparent to those skilled in the art. 
What is claimed is: 

1. An apparatus for implementing an interactive voice 
response application over a network, the apparatus compris- 
ing: 

a speech synthesizer operative to generate speech output 
characterizing at least a portion of a web page retrieved 
over the network; 

a grammar generator operative to process information in 
the retrieved web page to produce at least a portion of 
at least one grammar; and 

a speech recognizer having an input coupled to an output 
of the grammar generator, wherein the speech recog- 
nizer is operative to utilize the at least one grammar 
produced by the grammar generator to recognize 
speech input; 

wherein the at least one grammar produced by the gram- 
mar generator is utilized by the speech synthesizer to 
create phoneme information, such that similar pho- 
nemes are used in both the speech recognizer and the 
speech synthesizer. 

2. The apparatus of claim 1 wherein the apparatus further 
includes a processor operative to implement a function of at 
least one of the speech synthesizer, the grammar generator 
and the speech recognizer. 

3. The apparatus of claim 1 further including a parser 
which identifies textual information in the retrieved web 
page, and delivers the textual information to the grammar 
generator, 

4. The apparatus of claim 1 further including a voice 
processor which is operative to determine which of a set of 
predetermined models best characterizes the retrieved web 
page. 

5. The apparatus of claim 4 wherein the voice processor 
utilizes a default top-down description process if the 
retrieved web page is not adequately characterized by any of 
the predetermined models. 

6. The apparatus of claim 4 wherein the models charac- 
terize structure in the web page including at least one of a 
section heading, a table, a frame, and a form. 

7. The apparatus of claim 4 wherein the voice processor 
applies a plurality of different sets of models to the retrieved 
web page, each of the sets including at least one model. 

8. The apparatus of claim 1 wherein the speech 
synthesizer, the grammar generator and the speech recog- 
nizer are elements of an interactive voice response system 
associated with a service provider. * 

9. The apparatus of claim 1 wherein the speech synthe- 
sizer operates in a description mode, in which, unless 
interrupted by user input, the synthesizer provides a com- 
plete description of the retrieved web page to a user via the 
audio interface device, and an inspection mode, in which the 
synthesizer provides an abbreviated description of the 
retrieved web page and then awaits inspection command 
input from the user. 

10. The apparatus of claim 1 wherein the speech 
synthesizer, grammar generator and speech recognizer are 
used to implement a dialog system in which a dialog is 
conducted with a user via the audio interface device in order 
to control the output of the web page information to the user. 

11. The apparatus of claim 10 wherein the web page 
includes at least one of (i) text to be read to the user by the 
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speech synthesizer, (ii) a program script for executing opera- 
tions on a host processor, and (iii) a hyperlink for each of a 
set of designated spoken responses which may be received 
from the user. 

5 12. The apparatus of claim 10 wherein the web page 
includes at least one hyperlink that is to be utilized when the 
speech recognizer rejects a given spoken user input as 
unrecognizable, 

13. The apparatus of claim 10 wherein at least a portion 
of the grammar produced by the grammar generator is 
precompiled. 

14. A method for implementing an interactive voice 
response application over a network, the method comprising 
the steps of: 

generating speech output characterizing at least a portion 
15 of a web page retrieved over the network; 

processing information in the web page to produce at least 

a portion of at least one grammar; 
utilizing the grammar to recognize speech input; and 
utilizing the grammar to create phoneme information, 
20 such that similar phonemes are used in both the rec- 
ognizing and generating steps. 

15. The method of claim 14 further including the step of 
determining which of a set of predetermined models best 
characterizes the retrieved web page. 

25 16. The method of claim 15 further including the step of 
utilizing a default top-down description process if the 
retrieved web page is not adequately characterized by any of 
the predetermined models. 

17. The method of claim 15 further including the step of 
30 applying a plurality of different sets of models to the 

retrieved web page, each of the sets including at least one 
model. 

18. The method of claim 14 wherein the generating, 
processing and utilizing steps include implementing a dialog 

35 system in which a dialog is conducted with a user in order 
to control the output of the web page information to the user. 

19. The method of claim 18 wherein the web page 
includes at least one of (i) text to be read to the user, (ii) a 
program script for executing operations on a host processor, 

4Q and (iii) a hyperlink for each of a set of designated spoken 
responses which may be received from the user. 

20. The method of claim 18 wherein the web page 
includes at least one hyperlink that is to be utilized when a 
given spoken user input is rejected as unrecognizable. 

45 21. The method of claim 14 wherein at least a portion of 
the grammar produced in the utilizing step is precompiled. 

22, A machine -readable medium for storing one or more 
programs for implementing an interactive voice response 
application over a network, wherein the one or more pro- 

5Q grams when executed by a machine carry out the steps of: 
generating speech output characterizing at least a portion 

of a web page retrieved over the network; 
processing information in the web page to produce at least 
a portion of at least one grammar; 
55 utilizing the grammar to recognize speech input; and 
utilizing the grammar to create phoneme information, 
such that similar phonemes are used in both the rec- 
ognizing and generating steps. 

23. An interactive voice response system for communi- 
60 eating information between a network and an audio interface 

device, the system comprising: 

at least one computer for implementing at least a portion 
of an interactive voice response platform, the platform 
including: 

65 (i) a speech synthesizer operative to generate speech 
output characterizing at least a portion of a web page 
retrieved over the network; 
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(ii) a grammar generator operative to process informa- 
tion in the retrieved web page to produce at least a 
portion of at least one grammar; and 

(iii) a speech recognizer operative to utilize the at least 
one grammar produced by the grammar generator to 
recognize speech input; 

wherein the at least one grammar produced by the 
grammar generator is utilized by the speech synthe- 
sizer to create phoneme information, such that simi- 
lar phonemes are used in both the speech recognizer 
and the speech synthesizer. 

24. The system of claim 23 wherein the interactive voice 
response platform is associated with a service provider. 

25. The system of claim 23 wherein the interactive voice 
response platform implements a dialog system in which a 
dialog is conducted with a user in order to control the output 
of the web page information to the user. 

26. An apparatus for implementing an interactive voice 
response application over a network, the apparatus compris- 
ing: 

a speech synthesizer operative to generate speech output 
characterizing at least a portion of a web page retrieved 
over the network; 

a grammar generator operative to process information in 
the retrieved web page to produce at least a portion of 
at least one grammar; and 

a speech recognizer having an input coupled to an output 
of the grammar generator, wherein the speech recog- 
nizer is operative to utilize the at least one grammar 
produced by the grammar generator to recognize 
speech input; 

wherein the speech synthesizer operates in a description 
mode, in which, unless interrupted by user input, the 
synthesizer provides a complete description of the 
retrieved web page deliverable to a user via an audio 
interface device, and an inspection mode, in which the 
synthesizer provides an abbreviated description of the 
retrieved web page and then awaits inspection com- 
mand input from the user. 

27. A method for implementing an interactive voice 
response application over a network, the method comprising 
the steps of: 

generating speech output characterizing at least a portion 
of a web page retrieved over the network; 
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processing information in the web page to produce at least 
a portion of at least one grammar; and 

utilizing the grammar to recognize speech input; 

wherein a speech synthesizer used in the generating step 
5 generates one or more phonetic transcriptions, and the 
phonetic transcriptions are used in the utilizing step to 
recognize the speech input. 

28. A machine-readable medium for storing one or more 
programs for implementing an interactive voice response 

10 application over a network, wherein the one or more pro- 
grams when executed by a machine carry out the steps of: 
generating speech output characterizing at least a portion 

of a web page retrieved over the network; 
processing information in the web page to produce at least 
15 a portion of at least one grammar; and 

utilizing the grammar to recognize speech input; 
wherein a speech synthesizer used in the generating step 
generates one or more phonetic transcriptions, and the 
phonetic transcriptions are used in the utilizing step to 
recognize the speech input. 

29. An interactive voice response system for communi- 
cating information between a network and an audio interface 
device, the system comprising: 

at least one computer for implementing at least a portion 
of an interactive voice response platform, the platform 
including: 

(i) a speech synthesizer operative to generate speech 
output characterizing at least a portion of a web page 
retrieved over the network; 

(ii) a grammar generator operative to process informa- 
tion in the retrieved web page to produce at least a 
portion of at least one grammar; and 

(iii) a speech recognizer operative to utilize the at least 
one grammar produced by the grammar generator to 
recognize speech input; 

wherein the speech synthesizer operates in a descrip- 
tion mode, in which, unless interrupted by user input, 
the synthesizer provides a complete description of 
the retrieved web page deliverable to a user via the 
audio interface device, and an inspection mode, in 
which the synthesizer provides an abbreviated 
description of the retrieved web page and then awaits 
inspection command input from the user. 
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